Introduction

The data has detailed attributes for every player registered in the latest edition of FIFA 19 database, obtained scraping the website “sofifa.com”. Each instance is a different player, and the attributes give basic information about the players and their football skills. Basic pre-processing was done and Goal Keepers were removed for this assignment.

Please look here for the original data overview and attributes’ descriptions:

And here to get a better view of the information:


First look

[Task 1]: Load the dataset, completing the code below (keep the dataframe name as fifa)

# Loading
fifa <- read.csv("/Users/felix/Dropbox/CMPT459/Programming_Assignments/Assignment1/fifa.csv")

[Checkpoint 1]: How many rows and columns exist?

cat(ifelse(all(dim(fifa) == c(16122, 68)), "Correct results!", "Wrong results.."))
## Correct results!

[Task 2]: Give a very brief overview of the types of each attribute and their values. HINT: Functions str, table, summary.

# Overview

# To show types of each attribute
str(fifa)
## 'data.frame':    16122 obs. of  68 variables:
##  $ ID                      : int  158023 20801 190871 192985 183277 177003 176580 155862 188545 182521 ...
##  $ Age                     : int  31 33 26 27 27 32 31 32 29 28 ...
##  $ Overall                 : int  94 94 92 91 91 91 91 91 90 90 ...
##  $ Potential               : int  94 94 93 92 91 91 91 91 90 90 ...
##  $ Value                   : Factor w/ 211 levels "€0","€1.1M","€1.2M",..: 17 190 19 13 208 179 196 153 190 188 ...
##  $ Wage                    : Factor w/ 144 levels "€0","€100K","€105K",..: 95 75 56 67 65 78 82 71 34 67 ...
##  $ Height                  : Factor w/ 20 levels "5'1","5'10","5'11",..: 9 14 11 3 10 10 12 12 12 12 ...
##  $ Weight                  : Factor w/ 53 levels "110lbs","115lbs",..: 22 33 18 20 24 16 36 32 30 26 ...
##  $ LS                      : int  90 94 87 85 86 80 92 76 90 81 ...
##  $ ST                      : int  90 94 87 85 86 80 92 76 90 81 ...
##  $ RS                      : int  90 94 87 85 86 80 92 76 90 81 ...
##  $ LW                      : int  94 92 92 90 92 88 91 73 86 84 ...
##  $ LF                      : int  95 93 92 90 91 87 92 74 89 85 ...
##  $ CF                      : int  95 93 92 90 91 87 92 74 89 85 ...
##  $ RF                      : int  95 93 92 90 91 87 92 74 89 85 ...
##  $ RW                      : int  94 92 92 90 92 88 91 73 86 84 ...
##  $ LAM                     : int  95 91 92 91 92 90 90 74 86 87 ...
##  $ CAM                     : int  95 91 92 91 92 90 90 74 86 87 ...
##  $ RAM                     : int  95 91 92 91 92 90 90 74 86 87 ...
##  $ LM                      : int  93 91 91 91 92 89 89 75 84 85 ...
##  $ LCM                     : int  86 84 84 90 85 91 84 78 80 89 ...
##  $ CM                      : int  86 84 84 90 85 91 84 78 80 89 ...
##  $ RCM                     : int  86 84 84 90 85 91 84 78 80 89 ...
##  $ RM                      : int  93 91 91 91 92 89 89 75 84 85 ...
##  $ LWB                     : int  66 68 68 80 69 85 74 84 64 82 ...
##  $ LDM                     : int  63 64 63 80 66 84 73 87 65 85 ...
##  $ CDM                     : int  63 64 63 80 66 84 73 87 65 85 ...
##  $ RDM                     : int  63 64 63 80 66 84 73 87 65 85 ...
##  $ RWB                     : int  66 68 68 80 69 85 74 84 64 82 ...
##  $ LB                      : int  61 64 63 76 63 82 71 87 61 80 ...
##  $ LCB                     : int  49 56 50 69 52 74 68 90 60 75 ...
##  $ CB                      : int  49 56 50 69 52 74 68 90 60 75 ...
##  $ RCB                     : int  49 56 50 69 52 74 68 90 60 75 ...
##  $ RB                      : int  61 64 63 76 63 82 71 87 61 80 ...
##  $ Crossing                : int  84 84 79 93 81 86 77 66 62 88 ...
##  $ Finishing               : int  95 94 87 82 84 72 93 60 91 76 ...
##  $ HeadingAccuracy         : int  70 89 62 55 61 55 77 91 85 54 ...
##  $ ShortPassing            : int  90 81 84 92 89 93 82 78 83 92 ...
##  $ Volleys                 : int  86 87 84 82 80 76 88 66 89 82 ...
##  $ Dribbling               : int  97 88 96 86 95 90 87 63 85 81 ...
##  $ Curve                   : int  93 81 88 85 83 85 86 74 77 86 ...
##  $ FKAccuracy              : int  94 76 87 83 79 78 84 72 86 84 ...
##  $ LongPassing             : int  87 77 78 91 83 88 64 77 65 93 ...
##  $ BallControl             : int  96 94 95 91 94 93 90 84 89 90 ...
##  $ Acceleration            : int  91 89 94 78 94 80 86 76 77 64 ...
##  $ SprintSpeed             : int  86 91 90 76 88 72 75 75 78 62 ...
##  $ Agility                 : int  91 87 96 79 95 93 82 78 78 70 ...
##  $ Reactions               : int  95 96 94 91 90 90 92 85 90 89 ...
##  $ Balance                 : int  95 70 84 77 94 94 83 66 78 71 ...
##  $ ShotPower               : int  85 95 80 91 82 79 86 79 88 87 ...
##  $ Jumping                 : int  68 95 61 63 56 68 69 93 84 30 ...
##  $ Stamina                 : int  72 88 81 90 83 89 90 84 78 75 ...
##  $ Strength                : int  59 79 49 75 66 58 83 83 84 73 ...
##  $ LongShots               : int  94 93 82 91 80 82 85 59 84 92 ...
##  $ Aggression              : int  48 63 56 76 54 62 87 88 80 60 ...
##  $ Interceptions           : int  22 29 36 61 41 83 41 90 39 82 ...
##  $ Positioning             : int  94 95 89 87 87 79 92 60 91 79 ...
##  $ Vision                  : int  94 82 87 94 89 92 84 63 77 86 ...
##  $ Penalties               : int  75 85 81 79 86 82 85 75 88 73 ...
##  $ Composure               : int  96 95 94 88 91 84 85 82 86 85 ...
##  $ Marking                 : int  33 28 27 68 34 60 62 87 34 72 ...
##  $ StandingTackle          : int  28 31 24 58 27 76 45 92 42 79 ...
##  $ SlidingTackle           : int  26 23 33 51 22 73 38 91 19 69 ...
##  $ Release.Clause          : Factor w/ 1200 levels "","€1.1M","€1.2M",..: 284 80 285 224 181 100 163 24 80 143 ...
##  $ Preferred.Foot          : Factor w/ 2 levels "Left","Right": 1 2 2 2 2 2 2 2 2 2 ...
##  $ Work.Rate               : Factor w/ 9 levels "HighHigh","HighLow",..: 9 2 3 1 3 1 3 3 3 9 ...
##  $ Position                : Factor w/ 26 levels "CAM","CB","CDM",..: 21 26 14 19 11 19 23 18 26 9 ...
##  $ International.Reputation: int  5 5 5 4 4 4 5 4 4 4 ...
# Overview of distribution of values for each attribute
summary(fifa)
##        ID              Age           Overall        Potential    
##  Min.   :    16   Min.   :16.00   Min.   :46.00   Min.   :48.00  
##  1st Qu.:200917   1st Qu.:21.00   1st Qu.:62.00   1st Qu.:67.00  
##  Median :222028   Median :25.00   Median :66.00   Median :71.00  
##  Mean   :215084   Mean   :25.01   Mean   :66.46   Mean   :71.52  
##  3rd Qu.:236638   3rd Qu.:28.00   3rd Qu.:71.00   3rd Qu.:75.00  
##  Max.   :246620   Max.   :41.00   Max.   :94.00   Max.   :95.00  
##                                                                  
##      Value            Wage          Height         Weight    
##  €1.1M  :  407   €1K    :4032   6'0    :2609   154lbs :1398  
##  €375K  :  314   €2K    :2503   5'10   :2444   165lbs :1368  
##  €325K  :  308   €3K    :1699   5'9    :2226   159lbs : 915  
##  €1.2M  :  306   €4K    :1156   5'11   :2054   161lbs : 896  
##  €425K  :  301   €5K    : 786   6'1    :1588   172lbs : 866  
##  €1M    :  298   €6K    : 622   6'2    :1531   163lbs : 857  
##  (Other):14188   (Other):5324   (Other):3670   (Other):9822  
##        LS              ST              RS              LW       
##  Min.   :33.00   Min.   :33.00   Min.   :33.00   Min.   :27.00  
##  1st Qu.:54.00   1st Qu.:54.00   1st Qu.:54.00   1st Qu.:55.00  
##  Median :60.00   Median :60.00   Median :60.00   Median :62.00  
##  Mean   :59.84   Mean   :59.84   Mean   :59.84   Mean   :61.06  
##  3rd Qu.:66.00   3rd Qu.:66.00   3rd Qu.:66.00   3rd Qu.:68.00  
##  Max.   :94.00   Max.   :94.00   Max.   :94.00   Max.   :94.00  
##                                                                 
##        LF              CF              RF              RW       
##  Min.   :29.00   Min.   :29.00   Min.   :29.00   Min.   :27.00  
##  1st Qu.:55.00   1st Qu.:55.00   1st Qu.:55.00   1st Qu.:55.00  
##  Median :62.00   Median :62.00   Median :62.00   Median :62.00  
##  Mean   :60.74   Mean   :60.74   Mean   :60.74   Mean   :61.06  
##  3rd Qu.:68.00   3rd Qu.:68.00   3rd Qu.:68.00   3rd Qu.:68.00  
##  Max.   :95.00   Max.   :95.00   Max.   :95.00   Max.   :94.00  
##                                                                 
##       LAM          CAM          RAM           LM             LCM       
##  Min.   :29   Min.   :29   Min.   :29   Min.   :29.00   Min.   :32.00  
##  1st Qu.:55   1st Qu.:55   1st Qu.:55   1st Qu.:56.00   1st Qu.:54.00  
##  Median :62   Median :62   Median :62   Median :63.00   Median :61.00  
##  Mean   :61   Mean   :61   Mean   :61   Mean   :61.76   Mean   :60.24  
##  3rd Qu.:68   3rd Qu.:68   3rd Qu.:68   3rd Qu.:68.00   3rd Qu.:66.00  
##  Max.   :95   Max.   :95   Max.   :95   Max.   :93.00   Max.   :91.00  
##                                                                        
##        CM             RCM              RM             LWB       
##  Min.   :32.00   Min.   :32.00   Min.   :29.00   Min.   :32.00  
##  1st Qu.:54.00   1st Qu.:54.00   1st Qu.:56.00   1st Qu.:53.00  
##  Median :61.00   Median :61.00   Median :63.00   Median :60.00  
##  Mean   :60.24   Mean   :60.24   Mean   :61.76   Mean   :59.53  
##  3rd Qu.:66.00   3rd Qu.:66.00   3rd Qu.:68.00   3rd Qu.:66.00  
##  Max.   :91.00   Max.   :91.00   Max.   :93.00   Max.   :88.00  
##                                                                 
##       LDM             CDM             RDM             RWB       
##  Min.   :30.00   Min.   :30.00   Min.   :30.00   Min.   :32.00  
##  1st Qu.:51.00   1st Qu.:51.00   1st Qu.:51.00   1st Qu.:53.00  
##  Median :60.00   Median :60.00   Median :60.00   Median :60.00  
##  Mean   :58.85   Mean   :58.85   Mean   :58.85   Mean   :59.53  
##  3rd Qu.:66.00   3rd Qu.:66.00   3rd Qu.:66.00   3rd Qu.:66.00  
##  Max.   :90.00   Max.   :90.00   Max.   :90.00   Max.   :88.00  
##                                                                 
##        LB             LCB              CB             RCB       
##  Min.   :31.00   Min.   :27.00   Min.   :27.00   Min.   :27.00  
##  1st Qu.:52.00   1st Qu.:48.00   1st Qu.:48.00   1st Qu.:48.00  
##  Median :60.00   Median :59.00   Median :59.00   Median :59.00  
##  Mean   :58.83   Mean   :57.64   Mean   :57.64   Mean   :57.64  
##  3rd Qu.:66.00   3rd Qu.:67.00   3rd Qu.:67.00   3rd Qu.:67.00  
##  Max.   :87.00   Max.   :90.00   Max.   :90.00   Max.   :90.00  
##                                                                 
##        RB           Crossing      Finishing     HeadingAccuracy
##  Min.   :31.00   Min.   :11.0   Min.   :10.00   Min.   :15.00  
##  1st Qu.:52.00   1st Qu.:44.0   1st Qu.:36.00   1st Qu.:49.00  
##  Median :60.00   Median :56.0   Median :52.00   Median :58.00  
##  Mean   :58.83   Mean   :54.2   Mean   :49.76   Mean   :57.07  
##  3rd Qu.:66.00   3rd Qu.:65.0   3rd Qu.:63.00   3rd Qu.:65.00  
##  Max.   :87.00   Max.   :93.0   Max.   :95.00   Max.   :94.00  
##                                                                
##   ShortPassing      Volleys        Dribbling         Curve      
##  Min.   :20.00   Min.   :10.00   Min.   :14.00   Min.   :11.00  
##  1st Qu.:57.00   1st Qu.:35.00   1st Qu.:55.00   1st Qu.:39.00  
##  Median :64.00   Median :47.00   Median :63.00   Median :52.00  
##  Mean   :62.64   Mean   :46.75   Mean   :60.59   Mean   :51.25  
##  3rd Qu.:69.00   3rd Qu.:58.00   3rd Qu.:69.00   3rd Qu.:63.00  
##  Max.   :93.00   Max.   :90.00   Max.   :97.00   Max.   :94.00  
##                                                                 
##    FKAccuracy     LongPassing    BallControl    Acceleration  
##  Min.   :10.00   Min.   :19.0   Min.   :25.0   Min.   :20.00  
##  1st Qu.:34.00   1st Qu.:49.0   1st Qu.:58.0   1st Qu.:62.00  
##  Median :44.00   Median :58.0   Median :64.0   Median :69.00  
##  Mean   :46.43   Mean   :56.1   Mean   :63.2   Mean   :67.92  
##  3rd Qu.:58.00   3rd Qu.:65.0   3rd Qu.:70.0   3rd Qu.:76.00  
##  Max.   :94.00   Max.   :93.0   Max.   :96.0   Max.   :97.00  
##                                                               
##   SprintSpeed       Agility        Reactions       Balance     
##  Min.   :25.00   Min.   :23.00   Min.   :21.0   Min.   :22.00  
##  1st Qu.:62.00   1st Qu.:59.00   1st Qu.:56.0   1st Qu.:60.00  
##  Median :69.00   Median :68.00   Median :62.0   Median :68.00  
##  Mean   :67.99   Mean   :66.43   Mean   :62.2   Mean   :66.59  
##  3rd Qu.:76.00   3rd Qu.:75.00   3rd Qu.:68.0   3rd Qu.:75.00  
##  Max.   :96.00   Max.   :96.00   Max.   :96.0   Max.   :96.00  
##                                                                
##    ShotPower       Jumping         Stamina         Strength    
##  Min.   :14.0   Min.   :28.00   Min.   :27.00   Min.   :25.00  
##  1st Qu.:51.0   1st Qu.:59.00   1st Qu.:61.00   1st Qu.:59.00  
##  Median :61.0   Median :67.00   Median :68.00   Median :67.00  
##  Mean   :59.6   Mean   :66.02   Mean   :67.34   Mean   :65.91  
##  3rd Qu.:69.0   3rd Qu.:74.00   3rd Qu.:75.00   3rd Qu.:75.00  
##  Max.   :95.0   Max.   :95.00   Max.   :96.00   Max.   :97.00  
##                                                                
##    LongShots       Aggression    Interceptions    Positioning   
##  Min.   :11.00   Min.   :13.00   Min.   :10.00   Min.   :11.00  
##  1st Qu.:40.00   1st Qu.:50.00   1st Qu.:34.00   1st Qu.:46.00  
##  Median :54.00   Median :61.00   Median :56.00   Median :57.00  
##  Mean   :51.44   Mean   :59.58   Mean   :50.43   Mean   :54.82  
##  3rd Qu.:64.00   3rd Qu.:70.00   3rd Qu.:65.00   3rd Qu.:65.00  
##  Max.   :94.00   Max.   :95.00   Max.   :92.00   Max.   :95.00  
##                                                                 
##      Vision        Penalties       Composure        Marking     
##  Min.   :12.00   Min.   :12.00   Min.   :22.00   Min.   :10.00  
##  1st Qu.:47.00   1st Qu.:42.00   1st Qu.:53.00   1st Qu.:37.00  
##  Median :57.00   Median :52.00   Median :61.00   Median :56.00  
##  Mean   :55.47   Mean   :52.13   Mean   :60.35   Mean   :51.31  
##  3rd Qu.:65.00   3rd Qu.:62.00   3rd Qu.:68.00   3rd Qu.:65.00  
##  Max.   :94.00   Max.   :92.00   Max.   :96.00   Max.   :94.00  
##                                                                 
##  StandingTackle  SlidingTackle   Release.Clause  Preferred.Foot
##  Min.   :10.00   Min.   :10.00          : 1379   Left : 4003   
##  1st Qu.:35.00   1st Qu.:32.00   €1.1M  :  504   Right:12119   
##  Median :59.00   Median :56.00   €1.3M  :  384                 
##  Mean   :51.92   Mean   :49.63   €1.2M  :  359                 
##  3rd Qu.:67.00   3rd Qu.:65.00   €1.4M  :  357                 
##  Max.   :93.00   Max.   :91.00   €1.5M  :  314                 
##                                  (Other):12825                 
##         Work.Rate       Position    International.Reputation
##  MediumMedium:7779   ST     :2152   Min.   :1.000           
##  HighMedium  :3170   CB     :1778   1st Qu.:1.000           
##  MediumHigh  :1690   CM     :1394   Median :1.000           
##  HighHigh    :1015   LB     :1322   Mean   :1.115           
##  MediumLow   : 849   RB     :1291   3rd Qu.:1.000           
##  HighLow     : 697   RM     :1124   Max.   :5.000           
##  (Other)     : 922   (Other):7061

[Checkpoint 2]: Were functions used to display data types and give some idea of the information of the attributes?


Data Cleaning

Functions suggested to use on this part: ifelse, substr, nchar, str_split, map_dbl.

Five attributes need to be cleaned.

[Task 3]: The first 3 of the 5 attributes listed above that need to be cleaned are very alike. Create only one function to clean them the same way. This function should get the vector of attribute values as parameter and return it cleaned, so use it three times, each with one of the columns. Encode zeroes or blank as NA.

# Function used to clean attributes
attr_fix <- function(attribute){
  
    cleaned_attribute <- as.character(attribute)
    
    # Modify euro symbol character
    ifelse(grepl("\u20ac",cleaned_attribute),
                                cleaned_attribute <- gsub('\u20ac',"",cleaned_attribute),
                                cleaned_attribute <- cleaned_attribute)
    
    # Modify M character
    for (i in grep("M",cleaned_attribute)){
      cleaned_attribute[i] <- gsub("M","",cleaned_attribute[i])
      cleaned_attribute[i] <- as.numeric(cleaned_attribute[i]) * 10^6
    }
    
    # Modify K character
    for (i in grep("K",cleaned_attribute)){
      cleaned_attribute[i] <- gsub("K","",cleaned_attribute[i])
      cleaned_attribute[i] <- as.numeric(cleaned_attribute[i]) * 10^3
    }
    
    #encode zeroes or blank as NA
    cleaned_attribute[cleaned_attribute==""] <- NA
    cleaned_attribute[cleaned_attribute=="0"] <- NA
    
    #make attribute numeric
    cleaned_attribute <- as.numeric(cleaned_attribute)
    
    return(cleaned_attribute)
}

# Cleaning attributes
fifa$Value <- attr_fix(fifa$Value)
fifa$Wage <- attr_fix(fifa$Wage)
fifa$Release.Clause <- attr_fix(fifa$Release.Clause)

[Checkpoint 3]: How many NA values?

cat(ifelse(sum(is.na(fifa))==1779, "Correct results!", "Wrong results.."))
## Correct results!

[Task 4]: Clean the other two attributes. Hint: To convert to “cm” use http://www.sengpielaudio.com/calculator-bodylength.htm.

# Cleaning attribute Weight:
weight_fix <- function(attribute){
  cleaned_attribute <- as.character(attribute)

  #remove "lbs"
  for (i in grep("lbs",cleaned_attribute)){
    cleaned_attribute[i] <- gsub("lbs","",cleaned_attribute[i])
  }
  
  cleaned_attribute <- as.numeric((cleaned_attribute))
  
  return(cleaned_attribute)
}
fifa$Weight <- weight_fix(fifa$Weight)
# Cleaning attribute Height:
height_fix <- function(attribute){
    cleaned_attribute = as.character(attribute)
    
    # for values with foot only
    # multiply feet by 30.48 to get cm
      for (i in !grep("\'",cleaned_attribute)){
        ifelse(cleaned_attribute[i]=="",
               cleaned_attribute[i] <- NA,
               as.numeric(cleaned_attribute[i])*30.48)
      }
    
    #for values with foot and inches
    for (i in grep("\'",cleaned_attribute)){
      feet <- as.numeric(unlist(strsplit(cleaned_attribute[i],"\'"))[1])
      inch <- as.numeric(unlist(strsplit(cleaned_attribute[i],"\'"))[2])
      # 1 foot is 30.48 cm
      # 1 inch is 2.54 cm
      cleaned_attribute[i] <- feet*30.48 + inch*2.54
    }
    
    cleaned_attribute <- as.numeric(cleaned_attribute)
    
    return(cleaned_attribute)
}
fifa$Height <- height_fix(fifa$Height)

[Checkpoint 4]: What are the mean values of these two columns?

cat(ifelse(all(c(round(mean(fifa[,8]),4)==164.1339, round(mean(fifa[,7]),4)==180.3887)), "Correct results!", "Wrong results.."))
## Correct results!

Missing Values

[Task 5]: What columns have missing values? List them below (Replace ). Impute (so do not remove) values missing (that is all NA found) and explain the reasons for the method used. Suggestion: MICE imputation based on random forests .R package mice: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/, Use set.seed(1). HINT: Remember to not use “ID” nor “International.Reputation” for the imputation, if MAR (Missing at Random) is considered. Also later remember to put them back to the “fifa” dataframe.

Columns with missing values:

#Get columns with missing values
columns_with_na <- colnames(fifa)[apply(fifa,2,anyNA)]

library(mice)
## Loading required package: lattice
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
# Handling NA values
exclude <- c('ID','International.Reputation')
include <- setdiff(names(fifa),exclude)
to_impute <- fifa[include]
#MICE imputation based on random forests
imp_fifa <- mice (to_impute,m = 5,method ='rf',seed = 1)
## 
##  iter imp variable
##   1   1  Value  Wage  Release.Clause
##   1   2  Value  Wage  Release.Clause
##   1   3  Value  Wage  Release.Clause
##   1   4  Value  Wage  Release.Clause
##   1   5  Value  Wage  Release.Clause
##   2   1  Value  Wage  Release.Clause
##   2   2  Value  Wage  Release.Clause
##   2   3  Value  Wage  Release.Clause
##   2   4  Value  Wage  Release.Clause
##   2   5  Value  Wage  Release.Clause
##   3   1  Value  Wage  Release.Clause
##   3   2  Value  Wage  Release.Clause
##   3   3  Value  Wage  Release.Clause
##   3   4  Value  Wage  Release.Clause
##   3   5  Value  Wage  Release.Clause
##   4   1  Value  Wage  Release.Clause
##   4   2  Value  Wage  Release.Clause
##   4   3  Value  Wage  Release.Clause
##   4   4  Value  Wage  Release.Clause
##   4   5  Value  Wage  Release.Clause
##   5   1  Value  Wage  Release.Clause
##   5   2  Value  Wage  Release.Clause
##   5   3  Value  Wage  Release.Clause
##   5   4  Value  Wage  Release.Clause
##   5   5  Value  Wage  Release.Clause
## Warning: Number of logged events: 91
complete_fifa <- complete(imp_fifa,1)
# Putting columns not used on imputation back into "fifa" dataframe

fifa["Value"] <- complete_fifa["Value"]
fifa["Wage"] <- complete_fifa["Wage"]
fifa["Release.Clause"] <- complete_fifa["Release.Clause"]

#fifa <- cbind(complete_fifa),fifa[exclude])

[Checkpoint 5]: How many instances have at least one NA? It should be 0 now. How many columns are there? It should be 68 (remember to put back “ID” and “International.Reputation”).

cat(ifelse(all(sum(is.na(fifa))==0, ncol(fifa)==68), "Correct results!", "Wrong results.."))
## Correct results!

Feature Engineering

[Task 6]: Create a new attribute called “Position.Rating” that has the rating value of the position corresponding to the player. For example, if the player has the value “CF” on the attribute “Position”, then “Position.Rating” should have the number on the “CF” attribute. After that, remove the “Position” attribute from the data.

# Creating the attribute "Position.Rating"
# Iterating through all the rows
for (i in 1:nrow(fifa)){
    #get the rating value of the position
    position_name <- toString(fifa[i,67])
    #get the value of the position name
    val_position <- fifa[i,position_name]
    #Enter the value as the Position.Rating for that row
    fifa$Position.Rating[i] <- val_position
}
# Removing the attribute "Position"
#exclude the "Position" Column
exclude <- c('Position')
include <- setdiff(names(fifa),exclude)
#Include all columns except "Position"
fifa <- fifa[include]

[Checkpoint 6]: What’s the mean of the “Position.Rating” attribute created? How many columns are there in the dataframe? It should be 68 (remember to remove “Position”).

cat(ifelse(all(c(round(mean(fifa$Position.Rating),5) == 66.87067, ncol(fifa)==68)), "Correct results!", "Wrong results.."))
## Correct results!

Dimension Reduction

[Task 7]: Performe PCA (Principal Component Analysis) on the columns representing ratings of positions (that is, attributes: LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB). Show the summary of the components obtained. Keep the minimum number of components to have at least 98.50% of the variance explained by them.. Remove the columns used for PCA. HINT: Function prcomp, remember to center and scale.

# Perform PCA
#First rating of position is LS
#Last rating of position is RB
fifa.pca <- prcomp(fifa[grep("LS",colnames(fifa)):
                          grep("RB",colnames(fifa))],
                   center = TRUE, 
                   scale. = TRUE)

# Show Summary
summary(fifa.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     4.0535 2.9757 0.65474 0.48078 0.15613 0.11051
## Proportion of Variance 0.6319 0.3406 0.01649 0.00889 0.00094 0.00047
## Cumulative Proportion  0.6319 0.9725 0.98900 0.99789 0.99883 0.99930
##                            PC7     PC8     PC9    PC10      PC11    PC12
## Standard deviation     0.08463 0.06869 0.06073 0.05198 9.543e-14 3.8e-15
## Proportion of Variance 0.00028 0.00018 0.00014 0.00010 0.000e+00 0.0e+00
## Cumulative Proportion  0.99957 0.99975 0.99990 1.00000 1.000e+00 1.0e+00
##                             PC13      PC14      PC15    PC16      PC17
## Standard deviation     1.311e-15 7.103e-16 5.991e-16 4.7e-16 3.668e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.0e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.0e+00 1.000e+00
##                             PC18      PC19      PC20      PC21      PC22
## Standard deviation     3.577e-16 3.577e-16 3.577e-16 3.577e-16 3.577e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC23      PC24      PC25      PC26
## Standard deviation     3.577e-16 3.577e-16 2.135e-16 5.766e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00
#First 3 Components have at least 98.50% of the variance explained by them
# Put the components back into "fifa" dataframe
fifa <- cbind(fifa,fifa.pca$x)
#Only keep the first 3 principle components PC1 PC2 PC3
fifa <- fifa[,1:71]

# Remove original columns used for PCA
exclude <- colnames(fifa[9:34])
include <- setdiff(names(fifa),exclude)
fifa <- fifa[include]

[Checkpoint 7]: How many columns exist in the dataset? It should be 45.

cat(ifelse(ncol(fifa)==45, "Correct results!", "Wrong results.."))
## Correct results!

[Bonus]: Use the code below to see which columns influenced the most each component graphically. Replace “fifa.pca” with the object result from the use of prcomp function.

library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_var(fifa.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
)


Binarization

[Task 8]: Perform binarization on the following categorical attributes: “Preferred.Foot” and “Work.Rate”. HINT: R package “dummies”, function dummy.data.frame.

# Binarize categorical attributes
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
fifa <- dummy.data.frame(fifa,c("Preferred.Foot","Work.Rate"))

[Checkpoint 8]: How many columns exist in the dataset? It should be 54.

cat(ifelse(ncol(fifa)==54, "Correct results!", "Wrong results.."))
## Correct results!

Normalization

[Task 9]: Remove attribute “ID” from “fifa” dataframe, save attribute “International.Reputation” on vector named “IntRep” and then also remove “International.Reputation” from “fifa” dataframe. Perform z-score normalization on “fifa”, except for columns that came from PCA. Finally combine the normalized attributes with those from PCA saving on “fifa” dataframe. HINT: Function scale.

#Remove attribute "ID" from "fifa" dataframe
exclude <- c('ID')
include <- setdiff(names(fifa),exclude)
fifa <- fifa[include]

#save attribute "International.Reputation"" on vector named IntRep
IntRep <- fifa$International.Reputation

#Also remove "International.Reputation" from fifa dataframe
exclude <- c('International.Reputation')
include <- setdiff(names(fifa),exclude)
fifa <- fifa[include]
# Normalize with Z-Score
fifa_normalized <- scale(fifa[1:49],center = TRUE,scale=TRUE)

#Combine normalized attributes with those from PCA 
#saving on "fifa" dataframe
fifa <- cbind(fifa_normalized,fifa[50:52])

[Checkpoint 9]: How many columns exist in the dataset? It should be 52. What’s the mean of all the means of the attributes? Should be around zero.

cat(ifelse(ncol(fifa)==52, "Correct results!", "Wrong results.."))
## Correct results!

K-Means

[Task 9]: Perform K-Means for values of K ranging from 2 to 15. Find the best number of clusters for K-means clustering, based on the silhouette score. Report the best number of clusters and the silhouette score for the corresponding clustering (Replace below). How strong is the discovered cluster structure? (Replace below) Use “set.seed(1)”. HINT: Function kmeans (make use of parameters nstart and iter.max) and silhouette (from package “cluster”).

# K-Means and Silhouette scores
library(cluster)
library(purrr)

#getting distance for dataset
dist_fifa <- dist(fifa)
set.seed(1)

#avg_sil function from https://uc-r.github.io/kmeans_clustering#silo
avg_sil <- function(k) {
  km.res <- kmeans(fifa, centers = k,iter.max = 15, nstart = 25)
  ss <- silhouette(km.res$cluster, dist_fifa)
  mean(ss[, 3])
}

k.values <- 2:15

avg_sil_values <- map_dbl(k.values, avg_sil)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 806100)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 806100)
#based on code from https://uc-r.github.io/kmeans_clustering#silo
plot(k.values, avg_sil_values,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Average Silhouettes")

#silhouette score for clusters:2(best number of clusters)
cat("silhouette score for 2clusters(best number of clusters): " ,avg_sil_values[1] ,"\n")
## silhouette score for 2clusters(best number of clusters):  0.2136819

Results found:

[Checkpoint 9]: Are there silhouette scores for K-Means with K ranging from 2 to 15? Were the best K and correspondent silhouette score reported?


[Task 10]: Perform K-means with the K chosen and get the resulting groups. Try out several pairs of attributes and produce scatter plots of the clustering from task 9 for these pairs of attributes. By inspecting these plots, determine a pair of attributes for which the clusters are relatively well-separated and submit the corresponding scatter plot.

# K-Means for best K and Plot
k_means_2cluster <- kmeans(fifa, centers = 2,iter.max = 15, nstart = 25)
summary(k_means_2cluster)
##              Length Class  Mode   
## cluster      16122  -none- numeric
## centers        104  -none- numeric
## totss            1  -none- numeric
## withinss         2  -none- numeric
## tot.withinss     1  -none- numeric
## betweenss        1  -none- numeric
## size             2  -none- numeric
## iter             1  -none- numeric
## ifault           1  -none- numeric
plot1 <- plot(fifa[c("Aggression","Positioning")],main="Clustering for Attributes Pair: Aggression and Positioning",col = k_means_2cluster$cluster)

print("Attribute Pair Reactions and Vision have clusters relatively well-seperated")
## [1] "Attribute Pair Reactions and Vision have clusters relatively well-seperated"
plot2 <- plot(fifa[c("Reactions","Vision")],main="Clustering for Attributes Pair: Reactions and Vision",col = k_means_2cluster$cluster)

plot3 <- plot(fifa[c("LongPassing","LongShots")],main="Clustering for Attributes Pair: LongPassing and LongShots",col = k_means_2cluster$cluster)

plot4 <- plot(fifa[c("HeadingAccuracy","Penalties")],main="Clustering for Attributes Pair: HeadingAccuracy and Penalties",col = k_means_2cluster$cluster)

plot5 <- plot(fifa[c("Balance","BallControl")],main="Clustering for Attributes Pair: Balance and BallControl",col = k_means_2cluster$cluster)

[Checkpoint 10]: Is there at least one plot showing two attributes and the groups (colored or circled) reasonably separated?


Hierarchical Clustering

[Task 11]: Sample randomly 1% of the data (set.seed(1)). Perform hierarchical cluster analysis on the dataset using the algorithms complete linkage, average linkage and single linkage. Plot the dendrograms resulting from the different methods (three methods should be applied on the same 1% sample). Discuss the commonalities and differences between the three dendrograms and try to explain the reasons leading to the differences (Replace the below).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Sample and calculate distances
set.seed(1)
# Sample 1% of the dataset
fifa_sample <- sample_frac(fifa,0.01)
dist_sample <- dist(fifa_sample)
# Complete
hcluster_complete <- hclust(dist_sample,method = "complete")
plot(hcluster_complete,main="Complete Linkage Cluster Dendrogram",xlab = "Sample of Fifa Dataset")

# Average
hcluster_average <- hclust(dist_sample,method = "average")
plot(hcluster_average, main="Average Linkage Cluster Dendrogram",xlab = "Sample of Fifa Dataset")

# Single
hcluster_single <- hclust(dist_sample,method = "single")
plot(hcluster_single,main="Single Linkage Cluster Dendrogram",xlab = "Sample of Fifa Dataset")

Discussion:

Complete Linkage: The distance is defined as the farthest data point from one cluster to the farthest data point from the other cluster. So the clusters with the smallest diameters will merge together at each iteration.

Single Linkage: The distance is defined as the closest data point from one cluster to the closest data point from the other cluster.So the clusters with data points closest to datapoints of nearby clusters will merge together at each iteration.

Average Linkage: The distance is defined as an average distance of each point in one cluster to every point in the other cluster.

The dendrograms plotted illustrate the differences between the three algorithms. Single Linkage will merge the clusters with data points closest to each other so the dendrogram is more likely to form long chains.

Complete Linkage will merge clusters with big diameters later, so the dendrogram is more likely to have uniform sized trees.

Average Linkage is the middle ground between single linkage and complete linkage so the dendrogram looks in between the dendrograms for single and complete linkage.

[Checkpoint 11]: Does the discussion show commonalities and differences between the three dendrograms and explain the differences?


Clustering comparison

[Task 12]: Now perform hierarchical cluster analysis on the ENTIRE dataset using the algorithms complete linkage, average linkage and single linkage. Cut all of the three dendrograms from task 11 to obtain a flat clustering with the number of clusters determined as the best number in task 9.

To perform an external validation of the clustering results, use the vector “IntRep”" created. What is the Rand Index for the best K-means clustering? And what are the values of the Rand Index for the flat clusterings obtained in this task from complete linkage, average linkage and single linkage? Discuss the results (Replace below). HINT: Function cluster_similarity from package “clusteval”.

# Hierarchical Clusterings (Complete, Average and Single)

#Hierachical Clustering Complete Linkage
hclust_complete <- hclust(dist_fifa,method = "complete")

#Hierachical Clustering Average Linkage
hclust_average <- hclust(dist_fifa,method = "average")

#Hierachical Clustering Single Linkage
hclust_single <- hclust(dist_fifa,method = "single")
# Flat Clusterings

#Split into 2 clusters based on best cluster number for kmeans

#Complete Linkage Flat Clustering
flat_complete <- cutree(hclust_complete,2)

#Average Linkage Flat Clustering
flat_average <- cutree(hclust_average,2)

#Single Linkage Flat Clustering
flat_single <- cutree(hclust_single,2)
# Cluster Similarities
library(clusteval)

#Rand Index for Best K-means clustering (2 Clusters)
rand_kmeans <- cluster_similarity(IntRep,k_means_2cluster$cluster,similarity = "rand")
cat("Rand Index for Best K-means clustering (2 Clusters): ", rand_kmeans ,"\n")
## Rand Index for Best K-means clustering (2 Clusters):  0.4999695
#Rand Index for Complete Linkage Flat Clustering
rand_flat_complete <- cluster_similarity(IntRep,flat_complete,similarity = "rand")
cat("Rand Index for Complete Linkage Flat Clustering: " ,rand_flat_complete ,"\n")
## Rand Index for Complete Linkage Flat Clustering:  0.836581
#Rand Index for Average Linkage Flat Clustering
rand_flat_average <- cluster_similarity(IntRep,flat_average,similarity = "rand")
cat("Rand Index for Average Linkage Flat Clustering: ",rand_flat_average ,"\n")
## Rand Index for Average Linkage Flat Clustering:  0.8356267
#Rand Index for Single Linkage Flat Clustering
rand_flat_single <- cluster_similarity(IntRep,flat_single,similarity = "rand")
cat("Rand Index for Single Linkage Flat Clustering: ", rand_flat_single ,"\n")
## Rand Index for Single Linkage Flat Clustering:  0.8274063

Discussion:

[Checkpoint 12]: Does the discussion include relevant comparison of the clusters and makes sense?